{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "365d40b9-557e-41b5-8afc-c89154c3dacc",
   "metadata": {},
   "source": [
    "# Read data and get beta values\n",
    "\n",
    "First, import the necessary packages from pylluminator, and set the logger level to your convenience."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39701715-251b-4290-8c4d-7c1d0632d6a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pylluminator.annotations import Annotations, ArrayType, GenomeVersion\n",
    "from pylluminator.samples import Samples, read_samples\n",
    "from pylluminator.visualizations import betas_density, betas_dendrogram, betas_2D, nb_probes_per_chr_and_type_hist, methylation_distribution\n",
    "from pylluminator.utils import set_logger, download_from_geo\n",
    "\n",
    "# path to your data, or a folder where the downloaded data will be saved\n",
    "data_path = '~/data/pylluminator-tutorial/' \n",
    "\n",
    "# set the verbosity level, can be DEBUG, INFO, WARNING, ERROR\n",
    "set_logger('WARNING')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60f1c60c-ee6c-4d20-8bcc-13797f01703a",
   "metadata": {},
   "source": [
    "## Read .idat files"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2fc887eb-30fb-4f82-96f1-bbfbc6245b67",
   "metadata": {},
   "source": [
    "### Download test data\n",
    "\n",
    "For the purpose of this tutorial, you can download data from 'GEO database <https://www.ncbi.nlm.nih.gov/geo/>', or skip this step if you want to use you own data. \n",
    "\n",
    "We downloaded 3 samples from healthy prostate cells (PrEC) and 3 samples from prostate cancer cell (LNCaP)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e420427-a09d-4b64-9638-d6fd6034f7b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "download_from_geo(['GSM7698438', 'GSM7698446', 'GSM7698462', 'GSM7698435', 'GSM7698443', 'GSM7698459'], data_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b47ce9d8-162f-444e-95d9-ad58e99b0c72",
   "metadata": {},
   "source": [
    "### Define and read annotation\n",
    "First, define the array type of your data and the genome version in order to read the associated information files (manifest, masks, genome information...)\n",
    "\n",
    "The first run of this code might take a while, as the script will download the manifest and genome info files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24073c9e-ee40-491e-bc00-66c4256eb4f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# define those parameters to match your data\n",
    "array_type = ArrayType.HUMAN_EPIC_V2  # run ArrayType? to see available types\n",
    "genome_version = GenomeVersion.HG38  # run GenomeVersion? to see available genome versions\n",
    "\n",
    "# read annotation\n",
    "annos = Annotations(array_type, genome_version)\n",
    "\n",
    "#-------------------------------------------------------------\n",
    "\n",
    "# alternative: if you don't know what your data is, you can set annos to None and let the script detect the array version when reading samples\n",
    "# annos = None"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c899b36b-d0f8-4188-b278-ede0fa452dab",
   "metadata": {},
   "source": [
    "### Read samples .idat\n",
    "\n",
    "Now set the paths to your data and to the sample sheet if you have one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a59984be-fd86-4ecd-b4e6-5411ee905442",
   "metadata": {},
   "outputs": [],
   "source": [
    "# minimum number of detected beads per probe\n",
    "min_beads = 4\n",
    "\n",
    "# if no sample sheet is provided, it will be rendered automatically. Use parameter sample_sheet_path or sample_sheet_name to specify one.\n",
    "my_samples = read_samples(data_path, annotation=annos, min_beads=min_beads, keep_idat=True)  \n",
    "\n",
    "# let's check the generated sample sheet !\n",
    "my_samples.sample_sheet"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80cad892-c1b1-4d82-b22e-0d3819c64a35",
   "metadata": {},
   "source": [
    "Now is a good time to check if any sample sheet value has the wrong type. Having a category value encoded as a number can lead to unexpected results, so be sure to convert them to string if need (e.g. the sample ID or sentrix ID)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed56b27e-ea81-42c4-a082-a9742108fba3",
   "metadata": {},
   "outputs": [],
   "source": [
    "my_samples.sample_sheet.dtypes\n",
    "# we're all good here, since sentrix_id and sentrix_position are unknown - but if needed here is an example of how to change a column type:\n",
    "# my_samples = my_samples.astype({'sentrix_id': str})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "daf418fc-855e-46e6-b030-3c2fbda10348",
   "metadata": {},
   "source": [
    "Example of the signal dataframe for a sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ccd57c8-5188-4dff-85d8-544a5e4c92b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "my_samples.get_signal_df()['PREC_500_3']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a77c698d-bb92-4d15-be3b-c5979064fcc3",
   "metadata": {},
   "source": [
    "Let's add a custom column to our sample sheet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30f4fdbc-2cab-4d71-b6c0-358c5a47d61c",
   "metadata": {},
   "outputs": [],
   "source": [
    "my_samples.sample_sheet['sample_type'] = [n.split('_')[0] for n in my_samples.sample_sheet.sample_name]\n",
    "my_samples.sample_sheet"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ccb2305-df16-4322-b918-55ac29e832a3",
   "metadata": {},
   "source": [
    "### Save pylluminator samples (optional)\n",
    "\n",
    "As reading lots of samples can take a couple of minutes, you might want to save the Samples object to a file for later use. Here is how."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e57702cf-a320-48fe-bf7f-a4b5b2389742",
   "metadata": {},
   "outputs": [],
   "source": [
    "my_samples.save('raw_samples')\n",
    "\n",
    "# For later use, here is how to load samples :\n",
    "# my_samples = Samples.load('raw_samples')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa74608f-82f3-4bc9-b912-21c0e7a63eb3",
   "metadata": {},
   "source": [
    "## Calculate and plot beta values\n",
    "\n",
    "Once your samples are loaded, you can already calculate and plot the beta values of the raw data :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee643cd7-4d10-43a1-b77c-3b89d6d9cf19",
   "metadata": {},
   "outputs": [],
   "source": [
    "my_samples.calculate_betas(include_out_of_band=True)\n",
    "betas_density(my_samples, color_column='sample_type')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85ed3eb8-9b22-4719-bb88-041bda1b806a",
   "metadata": {},
   "source": [
    "## Preprocessing\n",
    "\n",
    "Here is the classical preprocessing pipeline for human samples. \n",
    "\n",
    "Note that each step modifies the Samples object directly, so it's useful to save the raw samples first (see step 1.4) if you want to try different preprocessing methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1222bedc-2125-489a-80fa-a4a813e38363",
   "metadata": {},
   "outputs": [],
   "source": [
    "my_samples.mask_quality_probes()\n",
    "my_samples.infer_type1_channel()\n",
    "my_samples.dye_bias_correction_nl()\n",
    "my_samples.poobah()\n",
    "my_samples.noob_background_correction()\n",
    "my_samples.calculate_betas(include_out_of_band=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d0fcbf40-3d09-4ada-8e74-57e635560b39",
   "metadata": {},
   "outputs": [],
   "source": [
    "my_samples.save('preprocessed_samples')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70fe6a34-4603-4b65-8b39-a9451d988795",
   "metadata": {},
   "source": [
    "Now let's see what the preprocessing changed in our beta values!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f96de58b-be13-430e-bb6d-62a3494132ca",
   "metadata": {},
   "outputs": [],
   "source": [
    "betas_density(my_samples, color_column='sample_type')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9597af72-0a29-41fc-8ebc-20e9775aa800",
   "metadata": {},
   "source": [
    "## Data insights"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "403b26e0-41a0-478c-a4a9-180a813d3c82",
   "metadata": {},
   "source": [
    "There are a few more plots you can already check to get to know your data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6bf8156-c074-4879-b884-8163413c83e2",
   "metadata": {},
   "outputs": [],
   "source": [
    "nb_probes_per_chr_and_type_hist(my_samples)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f03975d-f7a5-44d4-8f99-b6f7d659aaf2",
   "metadata": {},
   "source": [
    "Using a principal component analysis (PCA), we can visualize the spatial distribution of the samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4d9d7650-9555-41b7-948d-378bd1cb6648",
   "metadata": {},
   "outputs": [],
   "source": [
    "betas_2D(my_samples, model='PCA', color_column='sample_type')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "193370b3-76b4-4584-9c94-ac2e26e48376",
   "metadata": {},
   "source": [
    "The hierarchical clustering of the samples - nothing really surprising in the output :)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3dd1764c-956f-4c5f-b06f-26fe517895e4",
   "metadata": {},
   "outputs": [],
   "source": [
    "betas_dendrogram(my_samples)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8df3fe48",
   "metadata": {},
   "source": [
    "And finally, check the distribution of hypo- and hyper- methylated probes, depending on the probes location"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4acd8401",
   "metadata": {},
   "outputs": [],
   "source": [
    "methylation_distribution(my_samples, group_column='sample_type', figsize=(8, 5))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "pylluminator",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}